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Abstract 


We  discuss  the  implementation  of  a  quadratic  transportation  algorithm  on  massively 
parallel  computer  architectures  with  hypercube  communication  networks.  The  imple¬ 
mentation  is  optimal  in  the  sense  that  it  requires  effectively  0(  )  operations  to 

perform  one  iteration  of  an  mo  x  mj)  problem  using  P  processors.  Peak  computing 
rates  of  3  GFLOPS  are  achieved  by  the  algorithm  on  a  64K  Connection  Machine  CM-2. 


1  Introduction 

We  consider  in  this  paper  quadratic  optimization  problems  over  transportation  constraints. 
Such  problems  find  applications  in  logistics,  traffic  management,  and  matrix  balancing 
models  in  economics  and  regional  planning.  The  dual,  row-action  algorithm  of  Zenios  and 
Censor  [1989]  is  well  suited  for  implementation  on  massively  parallel  computer  architectures. 
In  this  report  we  describe  an  optimal  implementation  of  this  algorithm  on  a  massively 
parallel  architecture  with  a  hypercube  communication  network.  The  implementation  is 
optimal  in  the  sense  that  it  requires  effectively  0(™Qp'Q  )  operations  to  perform  one  iteration 
of  an  mo  X  mo  problem  using  P  processors.  Computational  experiments  on  a  Connection 
Machine  CM-2  indicate  that  the  implementation  achieves  a  computing  rate  of  3  GFLOPS 
when  solving  dense  1024  X  1024  problems  on  a  64K  system. 

2  The  Quadratic  Transportation  Algorithm 

Let  <  m  >  denote  the  set  {1,2,3, ... ,m} .  Denote  by  the  m-dimensional  Euclidean 
space.  A  transportation  graph  is  defined  as  the  triplet  Q  =  [Vo,Vo,£),  where  Vo  =< 
m0  >,  Vq  =<  rno  >  and  £  C  {(i,j)  |  i  £  Vo,j  £  Vb}.  Vo  and  Vo  are  the  sets  of 
origin  and  destination  nodes  of  cardinality  mo  and  mo  respectively.  £  is  the  set  of  n 
directed  arcs  ( *,  j ) ,  with  origin  node  i  and  destination  node  j ,  which  belong  to  the  graph, 
n  <  momo  •  Let  also  x  =  ( x,j )  £  3Rn  be  the  vector  of  flows;  u  =  (u^)  £  9?n  be  the  vector 
of  upper  bounds  on  the  flows;  s  =  (s;)  £  3?m° ,  for  i  £  Vo,  be  the  vector  of  supplies; 
d  =  (dj)  £  ,  for  j  £  Vo,  be  the  vector  of  demands;  tt°  =  {icf )  6  ,  for  i  £  Vo ,  be 

the  vector  of  dual  prices  for  origin  nodes;  ttd  —  (*f)  £  ,  for  j  £  Vo,  be  the  vector  of 

dual  prices  for  destination  nodes;  r  =  (r\j)  £  3Jn  be  the  vector  of  dual  prices  for  the  bound 
constraints;  Sf  =  {j  £  Vo  |  {i,j)  £  £}  be  the  set  of  destination  nodes  which  have  arcs  with 
origin  node  t,  and  8j  =  {i  £  Vo  \  ( i,j )  £  £},  the  set  of  origin  nodes  which  have  arcs  with 
destination  node  j .  With  this  notation  we  define  the  quadratic  transportation  problem  as 
follows: 


Minimize  F(z) 

Subject  to  : 
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where  {u/,j}  and  {c<j}  are  given  positive  real  numbers.  The  following  dual,  row-action, 
algorithm  for  this  problem  was  developed  in  Zenios  and  Censor  [1989],  where  its  asymptotic 
convergence  was  established. 

Step  0:  (Initialization)  Set  k  «—  0.  Get  x°,  (x°)°,  (rD)°,  r°  such  that: 
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Step  1:  (Iterative  step  over  constraint  set  (2)). 
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Step  2:  (Iterative  step  over  constraint  set  (3)). 
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Step  3:  (Iterative  step  over  constraint  set  (4)). 
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Step  4:  Replace  Jb  <—  Jb  +  1  and  return  to  Step  1. 
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3  An  Optimal  Implementation  on  the  Connection  Machine 

The  key  to  the  implementation  of  the  algorithm  on  the  Connection  Machine  CM-2  hyper¬ 
cube  is  the  configuration  of  the  processing  elements  into  a  two-dimensional  grid.  If  the 
grid  size  is  larger  than  the  number  of  processing  elements  then  we  configure  the  machine 
into  an  mo  X  mo  grid  of  virtual  processors  (VP),  and  each  physical  processing  element 
performs  the  work  of  several  virtual  processors.  The  dimension  of  the  grid  is  set  equal  to 
the  number  of  origin  nodes  mo  rounded  up  to  the  nearest  integer  that  is  a  power  of  two, 


and  the  other  dimension  is  set  equal  to  the  number  of  destination  nodes  mo  rounded  up 
in  the  same  fashion.  Virtual  processors  outside  the  mo  x  mo  grid  are  disabled  and  do  not 
participate  in  the  computations.  The  memory  of  virtual  processor  with  grid  coordinates 
( t,j )  is  partitioned  into  the  following  data  fields:  (1)  Supply  and  demand,  s  and  d,  (2) 
dual  prices,  r° ,  irD  and  r ,  (3)  Upper  bound  u,  (The  lower  bound  l  is  assumed  to  be  equal 
to  zero  and  hence  it  is  treated  as  a  constant.),  (4)  current  iterate  x,  (5)  a  scaling  factor 
scale  used  to  store  p ,  a  or  intermediate  results,  and  (6)  two  fields  IW  and  OW  hold  the 
terms  v-. — —  and  r-. — 1—. —  respectively.  The  algorithm  is  executed  as  follows: 

22ief- !/«-.,  x;>e,+ 

Step  1:  A  spread-sum  operation  along  the  1-axis  of  the  grid  computes  the  partial  sums 
]T\64+  Xij  for  each  origin  node  (i.e.,  each  row  of  the  grid).  This  result  is  spread 
to  the  scale  memory  fields  of  all  VP  in  the  same  row.  A  sub-mult  operation  (i.e., 
an  optimized  implementation  of  a(x  —  b))  is  used  to  compute  the  scaling  factor  p 
(equation  (6)).  The  seeding  factor  updates  the  local  field  x  by  addition  and  the  local 
field  tt°  by  subtraction  (equations  (7)  and  (8)  respectively). 

Step  2:  This  step  is  similar  to  Step  1,  with  a  spread-sum  along  the  0-axis. 

Step  3:  A  combination  of  min  and  max  operations  computes  the  mid(.)  of  equation  (9),  which 
is  then  used  to  update  the  local  field  a:  by  a  mult-add  operation  (equation  (11))  and 
local  field  r  (equation  (11)). 

3.1  The  Hypercube  Implementation 

In  this  section  we  describe  how  Steps  1-3  of  the  algorithm  are  implemented  to  rim  on  P 
physical  processors  of  the  CM-2  hypercube,  so  that  the  execution  time  is  in  effect  0(  mgp>e- ). 
To  simplify  our  discussion,  assume  that  mo,  mo  and  \/P  are  powers  of  two.  Let  the 
vertices  of  the  hypercube  be  labeled  with  the  integers  0  through  P  —  1 ,  where  vertices  i 
and  j  of  the  cube  are  connected  with  a  wire  when  the  absolute  value  of  i  —  j  is  a  power  of 
two.  We  initially  configure  the  mo  X  mo  grid  of  virtual  processors  so  that  each  physical 
processor  is  assigned  to  do  the  work  for  an  ^  x  ^  subgrid  of  virtual  processors.  In  figure 
1  we  have  mo  =  32,  mo  =  32,  P  =  16  and  each  physical  processor  does  the  work  for  an 
8x8  subgrid  of  virtual  processors. 

In  our  configuration,  we  lay  out  the  physical  processors  in  a  \fP  X  \[P  two-dimensional 
grid.  Let  P,_,  be  the  the  physical  processor  that  is  in  row  i  and  column  j  of  the  grid.  We 
place  physical  processor  Pij  at  vertex  iy/P  +  j  of  the  cube.  In  the  example  of  figure  1,  the 
16  physical  processors  are  configured  in  a  4  by  4  grid,  and  each  physical  processor  is  placed 
at  vertex  4i  +  j  of  the  hypercube.  The  arcs  in  figure  2  illustrate  how  the  CM-2  hypercube 
connects  the  \fP  x  \fP  grid  of  physical  processors. 

Most  of  the  operations  in  Steps  1-3  require  no  communication  between  virtual  processors. 
For  these  operations,  each  virtual  processor  spends  0(1)  time  to  work  on  its  local  memory, 
and  each  physical  processor  performs  the  work  of  mop2-  virtual  processors.  Therefore  the 
local  operations  require  0(  m<2p}-a  )  time. 

The  operations  that  do  require  communications  are  the  spread-sums,  where  we  add  the 
elements  in  each  row  of  the  virtual  grid,  and  copy  the  results  back  to  every  processor  of  each 
virtual  row.  Technically,  a  spread-sum  is  an  operation  where  a  field  scaleij  is  allocated  in 
each  virtual  processor,  and  scaleij  =  YJk=axik  •  If  the  transportation  graph  is  not  complete, 
then  for  each  missing  edge  (i  j),  is  set  to  zero  before  performing  the  spread-sum.  In  the 

rest  of  this  section,  we  show  how  a  spread-sum  is  performed,  effectively,  in  0(  771  )  time. 


Let  Xij  ai,  denote  the  value  of  x  that  lies  at  position  (a,  b)  in  the  subgrid  of  physical 
processor  Pij.  In  figure  1,  physical  processor  Pi:  holds  the  values 
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A  spread-sum  is  performed  in  three  phases.  In  the  first  phase,  each  physical  processor 
finds  the  subtotal  for  each  row  in  its  subgrid.  For  example,  in  figure  1,  each  physical 
processor  performs  the  following  8  additions: 
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Once  these  additions  have  been  performed,  each  physical  processor  holds  a  column  vector  of 
subtotals  (the  shaded  elements  in  figure  1).  Obtaining  these  subtotals  requires  0(  m~Qp'D- ) 
time. 

Let  c*  denote  the  column  of  subtotals  that  lies  in  physical  processor  Pij.  In  the 
second  phase  of  the  spread-sum,  the  processors  in  each  physical  row  i  coordinate  to  perform 
efficiently  the  vector  sum  SUM i  =  Cio  4-  Cn  -1 - 4-  The  resulting  vector  SUMi 

holds  the  spread-sums  for  the  rows  of  the  virtual  grid  that  are  assigned  to  physical  row 
t.  In  figure  1,  each  vector  SUMi  would  hold  the  spread-sums  for  virtual  rows  8t  through 
Si  +  7. 

Now  note  that  the  physical  processors  P;o,  Pi i,  Pa,  ...  of  physical  row  i  all  lie  in  a 
physical  subcube  of  the  machine  that  spans  log2\/P  dimensions.  (In  figure  2,  the  thick  arcs 
show  how  each  row  of  the  physical  grid  spans  a  two-dimensional  subcube.)  Let  Cij denote 
the  fcth  element  in  column  vector  Cij,  and  let  SUMi>k  denote  the  Jfcth  element  in  column 
vector  SUMi.  Given  the  log2\/P -dimensional  subcube,  we  use  the  procedure  listed  below 
to  obtain  the  first  log 2\/P  elements  of  vector  SUMi .  (Note  that  the  CM-2  hypercube  wires 
are  bi-directional  and  can  be  activated  simultaneously.) 

Let  t  =  log 2>/P; 

For  k  =  0  through  l  —  1  do 

(51)  Let  each  processor  Pij  send 

along  dimension  ( k  -f  0)  mod  t  of  the  subcube, 

Cijt\  along  dimension  (k  4- 1)  mod  l  of  the  subcube, 

Cijtt-i  along  dimension  (k  +  t  -  l)  mod  t  of  the  subcube; 

(52)  Let  each  processor  Pi:  receive 

a  value  tempo  along  dimension  (k  4-  0)  mod  l  of  the  subcube, 

a  value  tempi  along  dimension  (ife  4- 1)  mod  i  of  the  subcube, 

a  value  tempt-\  along  dimension  (k  4- 1  -  1)  mod  l  of  the  subcube; 

(53)  For  5  =  0  through  t  -  1  do 

Let  Cijtq  —  Cijq  4"  tempq ; 


Problem 

Itns. 

VP 

ratio 

Real  time 
(seconds) 

CM  time 
(seconds) 

Real  GFLOPS 
(64K  CM-2) 

CM  GFLOPS 
(64K  CM-2) 

TEST1 

5000 

256 

556.49 

532.92 

3.012 

3. 

1 

48 

TEST2 

4600 

128 

270.24 

255.62 

2.856 

3. 

0 

19 

TEST2 

4000 

256 

730.00 

676.00 

1.839 

1.986 

Table  1:  Performance  of  the  quadratic  transportation  algorithm  using  1024  x  1024.  First 
and  second  rows  report  results  with  the  optimal  implementation  of  the  algorithm.  Third 
row  reports  results  using  C/Paris  instructions  from  the  CM-2  library,  release  5.2. 


At  the  end  of  this  procedure,  each  value  *  for  k  =  0,...,/  -  1  equals  SUM{  j,.  If  we 
repeat  this  procedure  for  consecutive  groups  of  t  elements  along  vector  C{j ,  then  eventually 
each  vector  Cij  equals  SUM{ ,  which  completes  phase  two  of  the  spread- sum. 

Each  execution  of  steps  (SI)  and  (S2)  requires  0(1)  time.  On  a  Connection  Machine,  the 
execution  time  for  the  loop  in  step  (S3)  is  negligible  when  compared  to  the  communication 
time  in  steps  (Si)  and  (S2);  so  we  treat  the  loop  in  step  (S3)  as  an  0(1)  operation.  The 
outer  loop  of  this  procedure  iterates  t  times;  so  the  whole  procedure  requires  0(£)  time  to 
execute.  The  procedure  itself  is  repeated  for  groups  of  l  elements;  so  phase  two  of  the 
spread  sum  requires  0(  ^ )  time. 

If  step  (S3)  of  the  above  procedure  was  not  dominated  by  steps  (Si)  and  (S2),  then  we 
would  have  to  say  that  phase  two  of  the  spread-sum  requires  0(  ^log2 y/P)  time.  In  that 

case,  we  could  remove  the  log 2'/P  factor  by  using  a  straightforward  variant  of  the  more 
complex  one-to-all  broadcast  algorithm  that  is  given  by  Johnsson  and  Ho  [1989]. 

In  phase  three  of  the  spread-sum,  we  copy  the  column  vector  C.i  (which  now  equals 
SUMi )  across  processor  subgrid,  so  that  each  virtual  processor  holds  the  result  of  a 

spread-sum.  This  phase  requires  0(  mgplg  )  time.  Adding  the  execution  times  for  the  three 
phases  of  the  spread-sum,  we  get  an  execution  time  that  is  effectively  0(  ). 

3.2  Numerical  Results 

We  implemented  the  algorithm  of  Section  2  as  explained  in  Section  3  in  C/Paris,  using 
CMIS  instructions,  on  a  CM-2  with  32-bit  floating  point  accelerators.  The  performance 
of  the  parallel  implementation  was  evaluated  using  two  randomly  generated  test  problems 
of  dimension  1024  x  1024.  The  implementation  uses  virtual  processors  at  a  ratio  of  256 
and  128  virtual  processors  per  physical  processor,  on  a  4K  and  8K  CM-2  respectively.  The 
computing  rate  of  the  algorithm  under  different  virtual  processing  ratios  is  summarized  in 
Table  1.  The  last  row  of  the  same  table  provides,  as  a  benchmark,  the  performance  of  a 
C/Paris  implementation  of  the  algorithm  without  using  the  techniques  of  section  3.  Observe 
that  the  computing  rate  estimated  based  on  CM  time  is  consistently  in  excess  of  3  GFLOPS. 
Computing  rates  estimated  based  on  real  time  exceed  3  GFLOPS  for  higher  virtual  process¬ 
ing  ratios.  This  discrepancy  was  anticipated  since  the  first  phase  of  spread-sum  requires  no 
communications,  while  hypercube  communications  are  needed  in  the  second  phase  of  the 
operation.  The  C /Paris  implementation  is  significantly  slower  than  the  optimal  implemen¬ 
tation.  The  difference  in  the  number  of  iterations  between  the  two  runs  for  problem  TEST2 
is  due  to  the  difference  in  the  terminal  tolerance  specified.  Both  implementations  achieve 
identical  final  error  in  exactly  the  same  number  of  iterations. 
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Figure  1:  Virtual  Processor  Configuration.  Figure  2:  Physical  Processor  Configuration. 

4  Conclusions 

Gigaflop  performance  has  been  quite  common  in  several  areas  of  large  scale  scientific  com¬ 
puting  using  massively  (and  other)  parallel  architectures.  Unfortunately  such  performance 
is  difficult  to  achieve  in  numerical  optimization.  Optimization  problems  do  not  usually  have 
nice  structures  at  a  micro  level,  and  optimization  algorithms  need  frequent  communications. 
We  have  shown  that  for  some  well  structured  problems  we  may  overcome  the  communica¬ 
tion  bottleneck  and  achieve  performance  in  the  range  of  3  GFLOPS.  An  implementation 
that  is  effectively  optimal  was  achieved  using  a  simpler  scheme  than  the  one  of  Johnsson 
and  Ho  [1989]. 
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